Making Web Results Relevant with SAS ®

نویسندگان

  • Russell Albright
  • Jake Bartlett
  • David Bultman
چکیده

Many companies search the Web to learn about their competition and understand their potential customers. But how accurate are these search results? For instance, have you ever submitted the query "SAS", only to get results back about "Scandinavian Airline Systems"? This paper presents a SAS-based solution to accessing and clustering Yahoo! search engine results by using SAS Text Miner. We demonstrate how to use matrix factorization techniques, clustering algorithms, and visualizations to discriminate between subsets of documents that are returned as the result of a query. INTRODUCTION Searches, whether they are applied to Web pages, customer comments, or employee surveys, are the primary way for users to understand and navigate large, unexplored document collections. Search results obtained from submitted key words and phrases provide a perspective on the collection from which we can learn and discover the content that is available. However, these queries can only go so far in returning documents that are relevant to the researcher. The number of documents returned can be more than what is manageable, and the result set provided by the search engine may still be largely heterogeneous even though each returned document contains the query term(s). Most search engines such as Yahoo! and Google return the results of a query as a list of Web pages that are ranked by their relevancy to the query. No information is provided about the relationships between the elements in the list, and no feedback is provided about the scope of the content contained in the result set. Because of this, many attempts have been made to cluster the search results and label each group in a useful way. The advantages of clustering the search results have been documented (Zamir and Etzioni 1998) and they include: 1. Clusters provide users with an immediate global view of the result set. 2. They allow users to navigate the information in a reasonable way. 3. They provide immediate negative feedback if the query missed the mark retrieving relevant documents. Refining a search by adding keywords does not eliminate the benefit of clustering the search results. For instance, when a user submits a general term such as “sas” to a search engine, results can include SAS Institute, Scandinavian Airlines, and Radisson SAS Hotels, just to name a few. The results seem very diverse. If the user refines the query by searching on the two terms “sas” and “software”, the results will be more focused but, relative to the new result set, ambiguity will remain. The results of the refined query can be segmented into those that are related to the sale and marketing of SAS software, those that are related to how to program in SAS, those that have to do with real-world applications of SAS software, and so on. There have been some successes in building Web search engines that automatically cluster and report on the results. The Northern Light Group (www.northernlight.com) has provided categorized searches of for-fee content for over 10 years, and Vivisimo (clusty.com) has an online version that is becoming an increasingly popular search engine. In this paper, we demonstrate how the functionality available in SAS Text Miner 3.1 and the new technology available with SAS Text Miner 4.1 (SAS 9.2) enables users to access, understand, and analyze search results. Our focus is separated into the following three parts: 1. Accessing and retrieving the text we wish to analyze. 2. Clustering and labeling search results using a statistical clustering methodology. 3. Clustering and labeling search results using a factorization-based clustering approach. In the next section, we present some background information about SAS Text Miner. Following the background information, we demonstrate how to access query results from the Yahoo! Web Search service using Base SAS code. The code submits the query, retrieves the search results, and creates a data set out of the query results. In this section, we also create an extension node in SAS Enterprise MinerTM, which serves as a user interface for the Yahoo! search engine and outputs a data set that can be used directly by the Text Miner node. SAS Global Forum 2007 Data Mining and Predictive Modeling

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Stdinfo: from Sas/af to Sas/intrnet Reshma Kakkar and Ray L. Ransom, Centers for Disease Control and Prevention

Internet/Web based applications are becoming increasingly popular not only as a means of enabling more users to have access to data sources but also as a way of integrating them with emerging technologies. The decision, in this case, to move STDINFO, a SAS/AF application, to the Web, is based on a need to reach a larger audience and increase usage. The use of Webbased technology allows us to re...

متن کامل

Choice of Development Tool for the User Interface of a Client-Server Application in a SAS® Environment

Application developers in SAS environments regularly face the question about what tool to use to build the client-based user interface of their client-server applications. These are environments where: (1) the data is in server-based SAS data sets, (2) the primary processing is done with server-based SAS Software applications that cover file management, analysis, and reporting, and (3) the user...

متن کامل

SAS Tools for Educational Data Mining

Researchers in the EDM community have always relied on sophisticated tools to analyze data and build models. As the amount of data that can be collected and stored grows, the need for tools capable of handling “big data” becomes ever more prevalent. SAS Analytics U is a new initiative for making SAS data analysis and mining tools available for free to educational researchers and instructors. Th...

متن کامل

Prioritize the ordering of URL queue in Focused crawler

The enormous growth of the World Wide Web in recent years has made it necessary to perform resource discovery efficiently. For a crawler it is not an simple task to download the domain specific web pages. This unfocused approach often shows undesired results. Therefore, several new ideas have been proposed, among them a key technique is focused crawling which is able to crawl particular topical...

متن کامل

Using SAS Software to Analyze Sybase Performance on the Web

This paper provides a web-based system using SAS, HTML and CGI/PERL to provide rudimentary and complex Sybase DBMS performance metrics for Unix based system operations. Sybase SQL Server performance data is collected by Sybase Historical Server allowing for the collection of performance information with minimal impact on the server. The SAS System (Base SAS, Macro, STAT and SAS/Graph) is especi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007